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Abstract. Operational risk models commonly employ maximum likelihood 
estimation (MLE) to fit loss data to heavy-tailed distributions. Yet several de¬ 
sirable properties of MLE (e.g. asymptotic normality) are generally valid only 
for large sample-sizes, a situation rarely encountered in operational risk. In this 
paper, we study how asymptotic normality does-or does not-hold for common 
severity distributions in operational risk models. We then apply these results 
to evaluate errors caused by failure of asymptotic normality in constructing 
confidence intervals around the MLE fitted parameters. 

1. Introduction 

Maximum likelihood estimation (MLE) is a-if not the-standard method for fit¬ 
ting parametric distributions in operational risk models |AMA Group, 20T3| . Its 
widespread use is due in large part to properties that hold as the sample-size of 
loss data goes to infinity, namely, that MLE is a consistent, asymptotically nor¬ 
mal, and asymptotically efficient estimator. We focus here on the second property, 
asymptotic normality of MLE, and how this property relates operational value-at- 
risk (OpVaR) models. Informally, asymptotic normality of MLE means that the 
estimated parameters will be normally distributed about the true parameters with 
variance going to zero as the sample-size tends to infinity (greater detail will be 
given in Section]^. 

The assumption of sample-sizes approaching infinity, however, is hard to justify 
in operational risk, where capital estimates are driven by large and rare loss events. 
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thus making the asymptotic nature of asymptotic normality ground for concern in 
OpVaR models. The situation is summarized in [Embrechts et ah, 1997| , p. 318: 
“[AJlthough we have reliable numerical procedures for finding the MLE ..we are 
less certain about its properties, especially in the small sample case.” 

The challenge of small sample-sizes for MLE estimation is exacerbated by pres¬ 
sure to develop stand-alone OpVaR models. Rather than calculating a single Op¬ 
VaR for all of a given bank’s legal entities, and then sub-allocating this capital 
figure to the legal entities, some local regulators are asking for OpVaR models that 
calculate on the level of a given legal entity (or cluster of entities in a country). 
Hence the problem of fitting a heavy-tailed distribution to a relatively small number 
of high losses across a bank has been made more severe by effectively carving the 
bank-and its loss data-into even smaller pieces. 

In this first of two papers, we study how-if at all-asymptotic normality holds for 
heavy-tailed distributions common to OR modeling, and what this means for OR 
modeling. Specifically we 


(1) perform graphical and numerical tests of normality (Sections |3.1| and 3.2), 

(2) assess the normal approximation for parameter fitting confidence intervals 
(Section |3.3| ). 


The second point refers to using asymptotic normality to determine confidence 
intervals for parameter estimates. These confidence intervals are then used for 
goodness-of-fit tests when fitting tail distributions. Of course, these estimates are 
only reliable insofar as the parameter error is distributed as predicted by asymp¬ 
totic normality. In a second paper [Larsen, 20I6| , we investigate how MLE error 
translates into stability of OpVaR models (or a lack thereof). 

While this paper focuses on operational risk, the problem of estimating errors 
resulting from MLE fitting of heavy-tailed distributions is much more general. The 
heavy-tailed distributions we consider (Pareto, Weibull, lognormal, log-logistic and 
generalized Beta of the second kind) arise in many other contexts besides opera¬ 
tional risk, such as data networks, market models and insurance [Resnick, 2007 


Indeed, the type of OpVaR model considered here is commonly called the “actuar¬ 
ial approach,” and is characterized by modeling frequencies and severities of losses 
separately. 

Relating theoretical concerns to practice requires representative loss data. We 
use loss data from the recent ORX Association OpVaR stability study, in which 
participants were given loss data sets for 12 units of measure (UOM) consisting 
of anonymized loss data from member banks. Four UOMs were selected for this 
paper that spanned a range of loss profiles. Due to space considerations, analyses 
are given here for only one UOM (UOMl), with full results for this UOM and three 
others given in a separate appendix, [Lar sen, 

Main findings 

We now summarize our main findings. 


'20T5| . 


• In Sectionwe present the theory of asymptotic normality for the Pareto, 
Weibull, lognormal, log-logistic and GB2 distributions, and show how the 
typical assumptions of asymptotic normality are difficult, if not impossible, 
to verify for these severity distributuions. 

• In Section |3.1| we test asymptotic normality graphically for small sample 
sizes via simulated MLE parameters (parametric bootstrapping). These 
tests show poor results for the Weibull and GB2 distributions, while the 
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Pareto, lognormal and log-loglogistic distributions seem to display asymp¬ 
totic normality, even with small samples sizes. 

• Numerical tests of asymptotic normality are studied in Section [3.2[ where 
the resulting p-values indicate the lognormal distribution is only one for 
which the normality assumption is not completely rejected. 

• Despite the departures from normality that can be seen graphically and nu¬ 
merically, asymptotic normality gives a remarkably good approximation to 
MLE confidence intervals for the Pareto, lognormal and log-logistic distribu¬ 
tions (Section |3.3[ ). The performance for the Weibull and GB2 distributions 
is quite poor. 


The use of asymptotic normality to estimate MLE error is common in practice, 
and is also used in [Frachot et ah, 20041 [Cope et ah, 2009| to estimate parameter 
confidence intervals. Possible shortcomings of this approximation are mentioned in 
[Piacenza and Sordi, 2014] . 

To conclude this section, we mention a few technical points. For the probability 
distributions considered below, different sources give different names for parameters, 
and sometimes the distribution functions themselves vary from source to source. We 
give precise definitions in the next section, and generally follow parameter naming 
conventions as in the R packages below that were used for our analyses. 

The statistical analyses, graphics and typesetting were all done via the statistical 
software R |R Core Team, 2014|. The R packages beyond the ones in the base set-up 


are f itdistrplus Delignette-Muller et ah, 2013j, GB2 Graf and Nedyalkova, 2014|, 


ggplot2 [Wickham, 2009| , neldermead Bihorel and Baudin, 2014[ , VGAM [Yee, 2014[ , 
MVN [Korkmaz et ah, 2014[ , and Sweave Leisch, 2002 Leisch, 2003[ . 


2. OpVaR, ASYMPTOTIC NORMALITY OF MLE AND HEAVY-TAILED 

DISTRIBUTIONS 

2.1. Asymptotic normality of MLE. Informally, asymptotic normality of MLE 
says that the distribution of fitted parameters to data will be normally distributed, 
centered about the true parameters, with a prescribed covariance matrix that de¬ 
pends on the sample-size. Let X = (xi,... ,x„) be data from an underlying dis¬ 
tribution with probability density function (PDF) /(x|6**), where 0* are the true 
parameters. Then to test if asymptotic normality holds for this distribution, we 
need to test if, as n increases, MLE yields fitted parameters 0 that are normally 
distributed. This property can be tested with parametric bootstrapping, that is, 
for a fixed n, we sample n data points from the true distribution /(xjff*), and 
apply MLE to get We repeat this sample/fit procedure m times to obtain 

fitted parameters (0i,n, ■ ■ ■, 0m,n), i-e. we have generated statistics for MLE fitted 
parameters for a sample size of n. 

Before giving a precise statement of asymptotic normality for MLE, we first 
chronicle the regularity assumptions required for its proof. Define the log-likelihood 
function of the distribution as i{x\0) = \ogf{x\0). The notation Eg* [(/(a:|6*)] for a 
function g{x\0) means 

Eg* b(a;|6»)] = J g{u\0*)f{u\0*)du. 

Then the usual regularity conditions are [Cox and Binkley, 1979[ [Greene, 20TT| 
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( 1 ) 

( 2 ) 

(3) 


The first three partial derivatives of i{x\9) with respect to 9 are continuous 
and finite for almost all x and for all 0 in a neighborhood of 9*. 

2 , 


For all 9j, and i = 1 


Eg* 


d^(.{x\9) 


< oo. 


[ m) 

There exists a distribution function M{x) such that 
all 6* in a neighborhood of 9*, and Eg* [M{x))] < oo. 


d3i{x\0) I 

del I 


< M{x) for 


The Fisher information matrix for a /c-parameter distribution is the kx k matrix 
whose (*, j) entry is 


(I{9*)),^, = Eg* 


d_ 


e{x\9) 


d_ 

d9. 


iix\9) 


Given the regularity conditions above, the Fisher information matrix admits the 
following simpler description: 


( 2 . 1 ) 


= -Eg* 


d9,d9 


-t[x\9) 


A further requirement for asymptotic normality is identifiability (this require¬ 
ment is usually invoked when discussing the consistency of MLE, which states that 
the estimated parameters converge to the true parameters in probability as the 
sample-size goes to infinity). Informally, a parametric distribution family is identi¬ 
fiable if the parameter 9 uniquely determines the distribution (i.e. no two different 
parameter values yield the same distribution). More precisely, a distribution family 
is identifiable if for any 9i 92, there exists X = (a;i,..., Xn) for some n such that 
/(X|di)^/(X|02). 

Proving identifiability can be challenging, but if the moments of f{x\9) have a 
nice form, one strategy is as follows: Suppose for a contradiction that 9 ^ 9', and 
X ~ f{x\9), X' ~ f{x\9'). If further there exists a A: G N such that E[X^] ^ E[X'^], 
then it follows that there exists a subset of the parameters space U of non-zero 
measure such that fix\9) f{x\9') for all x G U, and hence the distribution family 
is identifiable. 

A further requirement is that the Fisher information matrix be non-singular in 
a neighborhood of 9*. The study of when this condition fails has led to recent 
interaction between statistical learning theory and geometry, since the Fisher in¬ 
formation matrix can be interpreted as a metric on the parameter space. Work of 
Sumio Watanabe and others develops a theory reconstructing many of the desirable 
properties of MLE in the case of singular Fisher information matrices by resolution 
of singularities from algebraic geometry [Watanabe, 2009 Watanabe, 2013| . 

The final requirement for asymptotic normality to hold is that the model has to 
be correctly specified: if MLE is applied for one parametric distribution to fit data 
coming from a different distribution, then of course results appealing to the “true” 
parameters will be suspect. 


Theorem 2.1 (Asymptotic normality of MLE). Under the conditions above, the 
MLE 9 is asymptotically normal: 

3/n(9-9*)A N{0,I{9*)-^), 
where convergence is in distribution. 















OPERATIONAL RISK MODELS AND ASYMPTOTIC NORMALITY 


5 


A proof of the theorem can be found in [Wald, 1943| ; sketch proofs are more abun¬ 
dant (see e.g. |Cox and Hinkley, 1979| ). 

Our main interest is in the asymptotic nature of this result. For an example 
of what can go wrong for finite sample-sizes, consider data (xi,..., Xn) sampled 
independently from the normal distribution N(fj,,a). Then MLE produces the 
estimator for the variance — /t)^, which is biased for finite n. 

This theorem gives a natural interpretation of the Fisher information matrix, 
which informally encodes how much information about the distribution is contained 
in each of the parameter directions. For simplicity, assume that the Fisher infor¬ 
mation is diagonal. Then large entries in the Fisher information matrix (high levels 
of information) correspond in Theorem 2.1 to small variations for MLE parame¬ 
ter estimation. In fact, a standard method to estimate MLE variance in numerical 
solvers is to calculate the Eisher information matrix at the optimal parameters, and 
invert it as in Theorem |2.1| As a corollary, such variance estimates are in general 
only valid insofar as Theorem |2.1| applies, in particular, under the assumption of 
large sample-sizes. 

The higher-order regularity conditions above can be challenging to check in 
practice, and lack an obvious statistical interpretation. It can happen that the 
conditions of Theorem 2.1 are not satisfied, yet asymptotic normality still holds 
|Le Cam, 1970[ [Smith, 1985| . Moreover, if the Fisher information matrix is singu¬ 
lar (and not identically of determinant 0), then the set of parameters for which it 
is singular is of co-dimension at least one in the space of parameters (this is the 
solution set of det I{9) = 0). Hence for almost all parameters 9, the Fisher informa¬ 
tion matrix will be non-singular. Particular care is thus warranted when applying 
MLE to the generalized Beta distribution of the second kind, described in the next 
section, since the assumptions about the Fisher information matrix are difficult to 
verify. 

A further challenge in bridging theory and practice is that for all but a few dis¬ 
tributions, the algorithms used to determine the optimal parameters are numerical, 
and may produce only a local maximum of the log-likelihood function. As we de¬ 
scribe the severity distributions under consideration below, we will thus also sketch 
the algorithms used in MLE and their potential shortcomings. 


2.2. Heavy-tailed distributions. The severity distributions used in OpVaR mod¬ 
els are generally heavy-tailed. We will take heavy-tailed to mean subexponential 
(definition below), but several other definitions exist in the literature. For the con¬ 
venience of the reader, we also sketch other common definitions and, where possible, 
relate them to one another. 


Definition 2.2. Let F be a cumulative distribution function with support in (0, oo). 
Then F is suhexponential if, for all n > 2, 


lim 

X—>CCi 


F^\x) 


—. . = n, 


Fix) 


where Fix) = 1 — Fix) is the tail, or survival function, and the numerator in the 
limit is the n-fold convolution of Fix). 


Subexponential distributions exhibit one of the general properties expected of heavy¬ 
tailed distributions on the level of aggregate losses, namely that the tail of the max¬ 
imum determines the tail of the sum. All of the distributions considered here are 
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subexponential (for the Weibull distribution, this holds when the shape parameter 
is less than one; see Section 2.2.2 below). 


Subexponentiality implies another property that is sometimes taken as the defi¬ 
nition of heavy-tailed, namely that the tail decays more slowly than any exponential 
function. With the notation as above, the precise formulation is that for all e > 0, 

lim e‘^^F{x) = oo. 

x—>-oo 

See [Embrechts et ah, 1997| , Lemma 1.3.5 (b) for the proof that a subexponential 
distribution function satisfies the above limit. 

An important subclass of subexponential distributions consists of regularly vary¬ 
ing functions: 

Definition 2.3. A positive, Lebesgue measurable function / on (0, oo) is regularly 
varying at oo with index a € K if 

fitx) 


lim 


fix) 


= t° 


for alH > 0. 


Note that the lognormal and Weibull distributions are not regularly varying. 

For a regularly varying F with tail index a > 0, all moments of the associ¬ 
ated random variable higher than a will be unbounded; see [Embrechts et ah, 1997| 
Proposition A.3.8 (d). Hence regular variation implies one final characterization of 
heavy tails. Some sources differentiate between heavy and “light’-tailed distribu¬ 
tions based on the existence of finite moments jPe Fontnouvelle et ah, 2007| . Under 
this classification, the lognormal and Weibull distributions are light-tailed since all 
moments are finite, while the Pareto, log-logistic and GB2 distributions all have 
infinite moments, and are considered in this usage to be heavy-tailed. 

One subclass of severity distributions we do not consider below arises from Ex¬ 
treme Value Theory, such as the the Generalized Extreme Value (GEV) distri- 
butionj^ These distributions are generally not calibrated via MLE, but rather 
with methods from Extreme Value Theory, such as Peaks-Over-Threshold (see 
[Embrechts et ah, 1997| ). Furthermore, consensus seems to have turned against 
EVT in operational risk (see e.g. [Mignola and Ugoccioni, 2005 for stability con¬ 
cerns with EVT). We thus limit our distributions to heavy-tailed distributions for 
which MLE is a prominent fitting method. Besides Extreme Value Theory distri¬ 
butions, we also pass over the g-and-h distribution, despite recent attention in the 
literature, since there is no closed-form for its PDF, and is most naturally fitted to 
data by quantile-matching Dutta and Perry, 2007[ (MLE fitting methods do exist, 
however [Rayner and MacGillivray, 2002[ ). 

The supports of the distribution families considered here may vary, which poses 
a problem when fitting loss data, especially in the case of a spliced distributions 
we study. For the Pareto distribution, one of the parameters defines the support, 
which contradicts an assumption required for the proof of asymptotic normality of 
MLE. Hence we set the parameter to the splice location, T, thus making the Pareto 
distribution a one-parameter family. 

For the lognormal, log-logistic and Generalized Beta distribution of the second 
kind, the support is the positive real numbers. For fitting a spliced distribution. 


^We implicitly treat the Generelized Pareto Distribution (GPD), since for heavy-tailed distri¬ 
butions the GPD reduces to the Pareto distribution. 
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there are two standard approaches: replace the distribution with either the shifted 
or the truncated version to ensure that the support is contained in the tail region 
of the spliced distribution. Truncated distributions can pose difficulties for MLE 
fitting [Opdyke and Cavallo, 2012] , plus explicit expressions for the resulting Fisher 
information matrices are more complicated, though in principle possible to obtain 
lEscobar and Meeker, 1998] . For these reasons, we consider exclusively shifted ver¬ 
sions of these distributions. 

2.2.1. Pareto Distribution. As mentioned above, the Pareto distribution is typically 
defined as a 2-parameter family, but since its support depends on one parameter 
(typically called the scale parameter, T), we fix this as the threshold of our spliced 
severity distribution (e.g. T = 100000), and consider the Pareto distribution as 
depending on one parameter, the shape, a, resulting in PDF 


f{x\a) = 


aT°‘ 

rra+l • 


where x >T, and is 0 otherwise. For X ~ Pareto{a), note that the first moment 
of X is bounded if and only if a > 1, and the variance is bounded only for a > 2. 

It is easy to show that the Pareto distribution is identifiable for all values of 
a, and the Fisher information matrix is the scalar I {9) = I {a) = 1/a^, which is 
indeed a positive-definite matrix (of size 1x1). The unique solution to the likelihood 
equation = 0 is 


YrMx,/Ty 


hence no numerical solver is required to perform MLE for the Pareto distribution. 


2.2.2. Weibull distribution. The Weibull distribution is a generalization of the ex¬ 
ponential distribution. For shape and scale parameters a,b > 0, the PDF is 

f{x\a,b) = {a/b){x/by~^ exp (-) 

In I Wei, 2007| , the three-parameter Weibull distribution is considered, with an 
extra location parameter u that determines the support. As noted above with 
reference to the Pareto distribution, a key assumption of MLE is that the support 
is independent of the parameters to be estimated. We thus consider the shifted 
distribution for Weibull, which is equivalent to setting the location parameter to 
u = T. 

The Fisher information matrix for the Weibull distribution is 


( 2 . 2 ) 




(,2 , 


see [Gupta and Kundu, 2006| , but note the parametrization: some sources define 
the scale parameter as the reciprocal of what is given here. 

The MLE properties of the Weibull distribution have been studied in [Smith, 1985| 
[Woodroofe, 1972][Akahira, 1975[ , although these works consider the three-parameter 
Weibull distribution for which MLE is especially problematic. It is shown that 
MLE is not even consistent if a < 1. For 1 < a < 2, MLE is not asymptoti¬ 
cally normal, while if a = 2, MLE is asymptotically normal, but with different 
covariance matrix than that of Theorem [2.1[ If a > 2, asymptotic normality holds 
(as well as asymptotic efficiency). Note that the Weibull distribution is heavy¬ 
tailed (i.e. subexponential) if and only a < 1 ([Embrechts et ah, 1997[, Example 
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1.4.3 for the if statement, while reverse implication follows from the existence of a 
Cramer-Lundberg exponent when a > 1), hence MLE for subexponential Weibull 
distributions results in an inconsistent estimator. 

The likelihood equations for the Weibull distribution can be solved explicitly. As 
will be seen in Section]^ (and [Larsen, 2015| ), for all four loss data sets to which 
we apply MLE, the “true” values of a are all less than one, leading to non-bounded 
Fisher information matrices. This manifests in the algorithm of fitdistrplus via 
warning messages that the resulting system is singular. In case of non-convergence 
of MLE, we discard the parameter estimate. 


2.2.3. Lognormal Distribution. The lognormal distribution \ogAf{fJ.,a) has PDF 


f{x\n,a) = 


1 


exp 


{\ogx - ^ly■‘ 
2cr2 


\f^(TX 

Note that we choose the second parameter to be cr and not . This convention 
makes a difference in calculating the Fisher information matrix, which is 


(2.3) 


i{e) = = 


l/c 


0 


0 2/a‘ 


The Fisher information matrix of the lognormal distribution is non-singular, since 
a random variable X is lognormal if and only if there is a normally distributed 
random variable Y with X = exp(F), and the function exp : K (0, oo) is a 
diffeomorphism (hence, loosely speaking, all properties involving derivatives and 
integrals that hold for normally distributed variables hold for lognormal, and vice- 
versa) . 

The lognormal distribution is also identifiable for all allowed {n,cr), since the 
same holds for the normal distribution. The Fisher information matrix is positive- 
definite, since it is a diagonal matrix with positive entries. As with the Pareto 
distribution, the likelihood equations for the lognormal distribution can be solved 
explicitly, hence the determination of ()2, a) is computationally unproblematic. 


2.2.4. Log-logistic Distribution. As with the lognormal distribution, a random vari¬ 
able X follows a log-logistic distribution if and only if there exists a logistic random 
variable Y such that X = exp(P). The log-logistic distribution LL{a,s) has PDF 


(2.4) 


fix\a,s) = 


[lY 


a;(l -I- {x/sYY 


for a; > 0, and is zero otherwise. The parameters a and s must both be positive. 
Like the Pareto distribution, a log-logistic distributions can have an unbounded 
first moment, namely, when a < 1, while for a < 2, the variance is also unbounded. 
The Fisher information matrix of AT ~ LL(a, s) is (jShoukri et ah, 1988|) 


(2.5) 


I{9) = I{a,s) 


( 3-L7r' 
I 9a2 

V 0 



which is positive-definite, since a, s > 0. Moreover, the log-logistic distribution 
is identifiable for a > 1, which follows from the above general strategy, since its 

median is s and its mode is s ( 1 , which is strictly increasing in a, and is 

hence injective. 
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To fit the log-logistic distribution, we use f itdistrplus, which performs opti¬ 
mization with the Nelder-Mead algorithm. For initial parameter values, we take 
(see e.g. [Johnemark, 2012|) 


Sinit = Median (ail ,... ,Xn) 

a ^ ^ log(n - 1 ) 

log(max(a;i,... 


2.2.5. Generalized Beta Distribution of the Second Kind. The GB2 distribution 
(also known as the transformed Beta distribution) is nested within the more general 
Feller-Pareto distribution FP(^, tr, 7,71, 72 ), and itself nests the Weibull, lognor¬ 
mal, and log-logistic distributions (as well as the generalized Pareto and inverse 
Burr distributions [Brazauskas, 200^ ), hence the GB2 distribution makes possible 
an evaluation of the trade-off between generality (GB2) and parsimony (Weibull, 
lognormal and log-logistic) when modeling OR loss data. 

The GB2 distribution has PDF 


( 2 . 6 ) 


^ bB{p,q){l + ix/b)‘^)P+i’ 


where B(p, q) is the Beta function (or Euler integral), defined for p, g > 0 as 


(2.7) 


B{p,q)= [ tP ^{l-t)q- Idt, 
Jo 


The mth moment of of X ~ GB2{a, b,p, q) is ^ moments 

are finite above the agth one [Bookstaber and McDonald, 1987| . 

The Fisher information matrix for the GB2 distribution can be derived from that 
of the Feller-Pareto distribution [Brazauskas, 2002) . Specifically, since GB2(a, 6 ,p, q) 
FP{0,b,l/a,q,p), we use the change-of-variable formula JIppJ^, where J is the 
Jacobian matrix of the coordinate change /r — >■ 0, cr — >■ 5 , 7 — >■ 1/a, 71 — g, 
and 72 — >■ p. Writing I = (lij) = the Fisher information matrix for the 
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GB2{a,b,p,q) distribution has entries 


^ 1.1 

11 .2 

11.3 

hA 

12.2 
h.s 
I 2 A 

13.3 
I 3 A 
I 4 A 


2 , 

a H-- 

p + q + 1 

pq ii^jp) - ipiq)) + q - p 

b{p + q- 1 ) 

q j-ipip) - Hq)) -1 

a{p + q) 

p{ip{q) - ip{p)) - 1 
a{p + q) 
aPpq 

52 (p + g + 1) 

aq 

b{p + q) 

—ap 
b{p + q) 

ijj'{p) - ijj'{p + q) 

-p^'ip + q) 

■</''(9) -'0'(P + 9 ), 


where = r'(a:)/r(x) is the digamma function, and its derivative tp'px) is the 
trigamma function. 

We implement our own MLE for the GB2 distribution as follows. We incor¬ 
porate the linear bounds on p, q by implicit penalty in the likelihood function, 
and minimize the negative log-likelihood function with the package NelderMead 
[Bihorel and Baudin, 2014] . To obtain initial parameter values, we use the pseudo- 
MLE functionality of the package GB2 [Graf and Nedyalkova, 20T^ . Since Nelder¬ 
Mead in general only returns a local minimum, we run the minimization at two 
other parameter start values, obtained by perturbing the loss data (losses shifted 
up for one, and shifted down for the other) and applying pseudo-MLE to the per¬ 
turbed loss data. If at least one of the three start parameters leads to a convergent 
solution, we take the calibrated parameters corresponding to the lowest negative 
log-likelihood value. 

Before turning to results, note that Theorem |2.1| assumes that we are able to 
find the global maximum of the likelihood function. For both the log-logistic and 
GB2 distributions, there is no guarantee of having found a global maximum. We 
hope that the descriptions of our methodology above will suffice to enable practi¬ 
tioners using the log-logistic or GB2 distributions to judge how their optimization 
algorithms differ. 


3. Asymptotic normality for OpVaR severity distributions 


In this section, we first evaluate asymptotic normality both graphically and nu¬ 
merically for the severity distributions described above when fitted to moderately 

respectively. In Section |3.3[ we examine 


heavy loss data in Sections |3.1| and |3.2[ 
what implications this has for approximating parameter confidence intervals with 
asymptotic normality. The corresponding results for three other sets of loss data 
can be found in [Larsen, 2015| . 
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To set the stage, we describe our simulation procedure (parametric boostrapping) 
in detail for a fixed loss data set losses and sample size n 

For distn in {pareto, Weibull, lognormal, log-logistic, GB2} with CDF F{x\9) 

(1) Fit distn to losses with MLE to obtain true parameters 9* 

(2) For z in {1,..., m} (we take m = 40, 000) 

(a) Draw n samples from the true distribution F{x\9*) to obtain boot¬ 
strapped losses losses^ 

(b) Fit distn to losses^ with MLE to obtain bootstrapped parameters 


For each distribution family and each sample-size n we thus obtain statistics 
for parameter estimation. We then compare each co mpon ent of the boostrapped 
parameters ..., 0m,n to the prediction of Theorem |2.l[ For example, each 0^ „ 
for a lognormal distribution will be a vector with two components, 9i^n = cri,n), 
and for each of these components we plot the corresponding kernel density estimate 
(essentially a smart histogram; see e.g. [Hastie et ah, 2009] , Chapter 6) against 
the normal distribution of Theorem |2 .1 1 for this component. Continuing with this 
example, the normal distribution corresponding to the bootstrapped /ii_„ will have 
variance 1 /(ct*)^. (For readibility, in the plots below we center the predicted normal 
distribution at the true value rather than 0, and move the factor of y/n to the right- 
hand side of the limit). Note that a comprehensive study of applicability of Theorem 
2.1 would require more than the plots and conhdence intervals we present, which 


only focus on the marginal distributions of the estimated parameters. 

For the below example, the underlying loss data set consists of losses, 19% of 
which are above the splicing threshold of 100,000 EUR. The data are not particu¬ 
larly heavy, with mean of 131560, median of 39018, and no losses larger than 30m 
EUR. Results for heavier loss data are found in [Larsen, 2015|. 


3.1. Graphical tests of asymptotic normality. The graphical tests of normal¬ 
ity of Eigures a-izi show how variablility in how normal the marginal parameter 
distribituions appear. On the ‘normal’ side, the bootstrapped parameters behave 
as in Theorem |2.1| even for sample-size as small as 100 for the Pareto (Eigure 
and lognormal distributions (Figure Nevertheless, for the Pareto distribution, 
the p-values tell a different story (see below), and for the lognormal distribution, 
the small-sample bias of the MLE estimator for a of the lognormal distribution can 
be seen). 

Among the remaining distributions of Weibull, log-logistic and GB2, each shows 
varying degrees of skewness; see Eigures[^[^|^[^ respectively. The skewness for the 
Weibull distribution does not decrease with increasing sample-size, as it does for the 
others. This behavior is not surprising, however, since the true shape parameter is 
0.56, i.e. the MLE is not consistent for this value, let alone asymptotically normal. 
The GB2 distribution for 100 tail losses shown in Figure gives by far the worst 
match to asymptotic normality. 


3.2. Numerical tests of asymptotic normality. The graphical tests above con¬ 
sider only the marginal boostrapped parameter distributions (i.e. is each MLE- 
fitted parameter indidually normally distributed as predicted by theory?). To fully 
test Theorem |2.1[ we turn to numerical tests of normality. 

Hypothesis testing for normally distributed data has its merits and demerits 
as a modeling tool, but it is nevertheless a feature of the regulatory statistical 
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landscape. The current task of assessing the validity of Theorem |2.1| based on 
simulated data is quite different from the usual situation of having real-world data. 


Bootstrapped shape: N(1.12, 0.113) 



shape 


Figure 1. UOMl: AN for Pareto: true 9 = (1.11), sample-size 100 


Bootstrapped shape: N(0.569, 0.045) 
vs theoretical N{0.56, 0.0415) 


Density plots 

B Empirical (Kernel Density) 
Theoretical 



Bootstrapped scale: N(217751,41073) 
vs theoretical N(212303, 37902) 


Density plots 

0 Empirical (Kerne) Density) 
Theoretical 



shape 


2e+05 3e+05 

scale 


Figure 2. UOMl: AN for Weibull: true 9 = (0.56, 212303.18), 
sample-size 100 



Bootstrapped scale: N(214799, 9547) 
vs theoretical N{212303, 7580) 


46-05- 


Oe+OO- 


Density plots 

B Empirical (Kernel Density) 
Theoretical 



200000 225000 250000 

scale 


Figure 3. UOMl: AN for Weibull: true 9 = (0.56, 212303.18), 
sample-size 2500 
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In the real-world, it is safe to assume that no large data set is truly normally 
distributed, thus diminishing the value of normality tests for large samples sizes. In 
the present context, we also know that our 40000 bootstrapped parameters are not 
normally distributed, but if Theorem 2.1 holds, then increasing sample sizes (i.e. 
assumed size of loss data) will lead to fewer rejections of the normality hypothesis, 
or, equivalently, bigger p-values. 

For the Pareto distribution, we use the Anderson-Darling test as implemented 
in nortest [Gross and Ligges, 2015| . This test has the advantage of working for 
all sample-sizes, unlike the Shapiro-Wilk test. All p-values except for 2500 samples 
are indistinguishable from 0 (i.e. smaller than 10“^^). For a sample-size of 2500, 
the p-value is 2.28e — 09. 

For the remaining multivariable distributions, we use the Mardia test [Mardia, 1970| 
as implemented in MVN [Korkmaz et ah, 2014| . Unlike the Anderson-Darling test. 
The Mardia test checks skewness and kurtosis separetely. The only distribution 
with skew p-values greater than 10“^^ is the lognormal distribution. Its p-values 
are shown in table where we see that the typical 5% significance level would not 
reject the normal hypothesis for samples sizes 1500 and 2500. 


Bootstrapped meanlog: N(11.3, 0.18) 



10.5 1l'.0 11.5 ^2.0 

meanlog 


Figure 4. UOMl: AN for lognormal: true 9 = (11.3, 1.8), 
sample-size 100 



CO 

c 

Q) 

■02 


Bootstrapped shape: N(1.01, 0.0867) 
vs theoretical N(1,0.0837) 



0.75 1.00 1.25 1.50 

shape 


Bootstrapped scale: N(85388, 14958) 



60000 90000 120000 150000 

scale 


Figure 5. UOMl: AN for log-logistic: true 9 = (1, 84000), 
sample-size 100 
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The kurtosis test leads to essentially 0 p-values with the GB2 distribution. The 
kurtosis p-values for the other distributions are shown in Table A 5% significance 
level would fail to reject the normal hypothesis for the lognormal distribution across 
the whole range of sample-sizes considered here, and likewise for the log-logistic 
distribution for the 1000 and 1500 sample-sizes. 


Bootstrapped shapel: N(2.13, 6.58) 



shapel 


Bootstrapped scale: N(9.13e+10,1.2e+13) 



Density plots 

■Empirical (Kernel Density) 
^Theoretical 




Bootstrapped shape2: N{2.08, 8.34) 
vs theoretical N{1.18,1.23) 



shape2 


Bootstrapped shape3: N{252502, 33232661) 
vs theoretical N(1.45,1.65) 



shapes 


Figure 6 . UOMl: AN for GB2: true 9 = (0.837, 117516.887, 
1.184, 1.454), sample-size 100 


Bootstrapped shapel: N{0.869, 0.13) 
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Bootstrapped scale: N(118809,19350) 



16+05 26+05 3e+05 
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Bootstrapped shape2: N(1.17, 0.268) 



12 3 4 


shape2 


Bootstrapped shape3: N(1.44, 0.363) 
vs theoretical N(1.45, 0.33) 



2 4 6 


Shapes 


Figure 7. UOMl: AN for GB2: true 9 = (0.837, 117516.887, 
1.184, 1.454), sample-size 2500 
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The Weibull distribution is the one that most clearly fails the assumptions of 
Theorem |2.1[ so it is no surprise that the kurtosis p-values do not increase monoton- 
ically. The expected monotone increase of kurtosis p-values with increasing sample 
sizes is not apparent for the lognormal distribution. This behavior could result from 
numerical instability for such large data sets. 

Table 1. Mardia skew p-values per sample-size 


100 

200 

300 

500 

1000 

1500 

2500 

lognormal 0 

0.01 

0.03 

0.03 

0.17 

0.32 

0.26 


Table 2. Mardia kurtosis p-values per sample-size 



100 

200 

300 

500 

1000 

1500 

2500 

Weibull 

0.00 

0.00 

0.00 

0.00 

0.02 

0.00 

0.00 

lognormal 

0.70 

0.44 

0.88 

0.36 

0.81 

0.70 

0.57 

log-logistic 

0.00 

0.00 

0.00 

0.00 

0.55 

0.36 

0.00 


In contrast to the p-values presented in this section, the next section looks beyond 


whether or not Theorem 2.1 holds to study the validity of the commonly-used 
normal approximation to estimate parameter confidence intervals. 

3.3. Approximating parameter confidence intervals. We now turn to ques¬ 
tion from the introduction about the use of Theorem |2.1| to approximate param¬ 
eter confidence intervals. Several methods exist for evaluating the goodness-of-fit 
of a severity distribution to loss data. The one we focus on here follows directly 
from asymptotic normality. Assuming that Theorem |2.1| holds, then the MLE fit¬ 
ted parameters are normally distributed about the “true” (i.e. fitted) parameters, 
with covariance matrix determined by the Fisher information matrix (see Section 
2 .1). As we have seen already, however, this normal assumption is questionable for 
small-sample sizes of operational risk data. In this section, we quantify the resulting 
confidence interval error. 

The below table shows the percent error of using this approximation relative 
to the “true” 95% confidence intervals obtained by quantiles of the boostrapped 
parameters, i.e. we compare the difference of the 97.5% quantile and 2.5% quan¬ 
tile from 40,000 bootstrapped parameters to same difference of quantiles from the 
normal distribution dictated by Theorem |2.1| 

The results in Table mirror what can be seen from the plots: for the Pareto, 
lognormal and log-logistic distributions, the normally approximated 95% confidence 
intervals are within a few percent of the true ones, while the approximation is 
relatively poor for the Weibull and GB2 distributions. For GB2, the normally 
approximated confidence intervals are within 10% of the true ones given enough 
data (2500 losses). The approximation for the Weibull distribution gets worse as 
sample sizes increase, a phenomenon that can also be seen from Figures and 
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Table 3. Percent error of 95% confidence intervals derived from 
asymptotic normality by sample-size 



100 

200 

300 

500 

1000 

1500 

2500 

Pareto shape 

2 % 

1 % 

0 % 

0 % 

1 % 

0 % 

-1% 

Weibull shape 

8 % 

7% 

7% 

6 % 

6 % 

5% 

6 % 

Weibull scale 

9% 

8 % 

9% 

10 % 

13% 

16% 

22 % 

lognormal meanlog 

0 % 

1 % 

0 % 

1 % 

0 % 

0 % 

0 % 

lognormal sdlog 

0 % 

0 % 

0 % 

0 % 

0 % 

-1% 

0 % 

log-logistic shape 

3% 

1 % 

0 % 

0 % 

0 % 

0 % 

-1% 

log-logistic scale 

2 % 

2 % 

1 % 

1 % 

0 % 

1 % 

0 % 

GB2 shapel 

73% 

31% 

21 % 

15% 

11 % 

8 % 

7% 

GB2 scale 

79% 

78% 

73% 

61% 

31% 

17% 

7% 

GB2 shape2 

43% 

50% 

50% 

45% 

26% 

15% 

6 % 

GB2 shape3 

43% 

55% 

58% 

53% 

30% 

17% 

7% 
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4. Conclusions 


The only severity distribution considered here for which asymptotic normality 
clearly holds-even for sample sizes typical of operational loss data-is the lognormal 
distribution. Recall that MLE for it, along with the Pareto distribution, can be 
performed analytically (i.e. solving a simple equation). More generally, all key as¬ 
sumptions of Theorem |2.1| can be effectively verified for the lognormal distribution. 
Asymptotic normality also gives very good approximations for MLE parameter 
confidence intervals, with errors no larger than 1% across all sample sizes. 

The Pareto and log-logistic distributions fare similarly well, with normally ap¬ 
proximated confidence intervals leading to errors no more than 3%. In contrast to 
the lognormal distribution, however, the Anderson-Darling p-values for the Pareto 
shape parameter indicated a rejection of the normal hypothesis for all sample sizes. 

The generalized Beta distribution of the second kind (GB2) fares the worst in 
all tests of asymptotic normality. When it comes to desirable MLE properties and 
confidence intervals, this distribution should only be used with great caution. 

A more interesting “under-performer” here is the Weibull distribution. In Section 

we show that a Weibull distribution is sub-exponential precisely when its MLE 
is inconsistent. The graphical and numerical tests of asymptotic normality bear 
out this problem. The normal approximation error for MLE confidence intervals 
even gets worse as the sample size increases for the scale parameter. Although 
these analyses give clear warning signals about the MLE properties of Weibull as 
an operational risk severity distribution, we will show in [Larsen, 2016| that its 
stability properties as a function of sample size are good when compared to the 
other distributions. 

The generally positive results for using asymptotic normality to approximate 
parameter confidence intervals should be taken also in the broader context opera¬ 
tional risk modeling, with particular detail to the assumptions our simulation study 
makes and how this relates to real-world loss data. 

Parametric bootstrapping is by definition compatible with one of the key as¬ 
sumptions of MLE, namely that data are independent and identically distributed, 
since the bootstrapped data samples are drawn independently from a single “true” 
distribution. That these assumptions hold for actual loss data has been questioned, 
and the resulting impact for loss data that are not independent and identically dis¬ 
tributed on MLE is a focus of jOpdyke and Cavallo, 2012] . Operational risk litera¬ 
ture is not unanimous on this question, but the regular assessments of the ORX loss 
data consortium offer general backing for this assumption 'Cope and Antonini, 2008 
[Analytics, 2012 . Besides the assumption of large sample-sizes, we investigate as¬ 
sumptions on distribution being fitted. It is, however, well-known that MLE can 
exhibit asymptotic normality even when the assumptions of the theory are not 
met (see e.g. [Smith, 1985[ ). Loosely put, the assumptions are necessary to prove 
the result, but not necessarily for the result to hold. Eor three of the distributions 
(Pareto, lognormal, and log-logistic), we see high levels of agreement between theory 
and simulation, even for very small sample-sizes. The assumptions of asymptotic 
normality for the generalized Beta distribution of the second kind are difficult to 
verify. 

We have also assumed implicitly the standard MLE requirement that we know 
the “true” underlying distribution. Robustness of a fitting method to misspecified 
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models (i.e. selecting the wrong “true” distribution) is itself a topic of research; see 
e.g. [Ergashev, 2008[[0pdyke and Cavallo, 2012| . 
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